Exploring Learner Predictions

Learners use features to make predictions but how those features are used is often not apparent. mlr can estimate the dependence of a learned function on a subset of the feature space using generatePartialPredictionData.

Partial prediction plots reduce the potentially high dimensional function estimated by the learner, and display a marginalized version of this function in a lower dimensional space. For example suppose , where . With pairs drawn independently from this statistical model, a learner may estimate , which, if is high dimensional can be uninterpretable. Suppose we want to approximate the relationship between some subset of (lower dimensional than : possibly unidimensional). We partition into two sets, and such that , where is a subset of of interest.

The partial dependence of on is:

is integrated out. We use the following estimator:

The individual conditional expectation of an observation can also be estimated using the above algorithm absent the averaging, giving . This allows the discovery of features of that may be obscured by an aggregated summary of . See Goldstein, Kapelner, Bleich, and Pitkin (2014) for more details and their package ICEbox for the original implementation. The algorithm works for any supervised learner with classification, regression, and survival tasks.

Usage

Generating Partial Predictions

Our implementation, following mlr's visualization pattern, consists of the above mentioned function generatePartialPredictionData, as well as two visualization functions, plotPartialPrediction and plotPartialPredictionGGVIS. The former generates input (objects of class partialPredictionData) for the latter.

The first step executed by generatePartialPredictionData is to generate a feature grid for every element of the character vector features passed, which must be a column name in the data argument, which is usually the training data. The feature grid can be generated in several ways. A uniformly spaced grid of length gridsize (default 10) from the empirical minimum to the empirical maximum is created by default, but arguments fmin and fmax may be used to override the empirical default (the lengths of fmin and fmax must match the length of features). Alternatively the feature data can be resampled, either by using a bootstrap or by subsampling.

lrn.classif = makeLearner("classif.ksvm", predict.type = "prob")
fit.classif = train(lrn.classif, iris.task)
iris = getTaskData(iris.task)
pd = generatePartialPredictionData(fit.classif, iris, "Petal.Width")
pd
#> PartialPredictionData
#> Task: iris-example
#> Features: Petal.Width
#> Target: setosa, versicolor, virginica
#> Interaction: FALSE
#> Individual: FALSE
#>    Class Probability Petal.Width
#> 1 setosa   0.1133617    2.500000
#> 2 setosa   0.1016932    2.233333
#> 3 setosa   0.1000598    1.966667
#> 4 setosa   0.1091532    1.700000
#> 5 setosa   0.1406860    1.433333
#> 6 setosa   0.2131172    1.166667

As noted above, does not have to be unidimensional. If it is not, the interaction flag must be set to TRUE. Then the individual feature grids are combined using the Cartesian product, and the estimator above is applied, producing a partial prediction for every combination of unique feature values. If the interaction flag is FALSE (the default) then by default is assumed unidimensional, and partial predictions are generate for each feature separately. The resulting output when interaction = FALSE has a column for each feature, and NA where the feature was not used in generating partial predictions.

pd.lst = generatePartialPredictionData(fit.classif, iris, c("Petal.Width", "Petal.Length"), FALSE)
head(pd.lst$data)
#>    Class Probability Petal.Width Petal.Length
#> 1 setosa   0.1133617    2.500000           NA
#> 2 setosa   0.1016932    2.233333           NA
#> 3 setosa   0.1000598    1.966667           NA
#> 4 setosa   0.1091532    1.700000           NA
#> 5 setosa   0.1406860    1.433333           NA
#> 6 setosa   0.2131172    1.166667           NA

tail(pd.lst$data)
#>        Class Probability Petal.Width Petal.Length
#> 55 virginica   0.3386905          NA     4.277778
#> 56 virginica   0.2364844          NA     3.622222
#> 57 virginica   0.1700154          NA     2.966667
#> 58 virginica   0.1774907          NA     2.311111
#> 59 virginica   0.2287907          NA     1.655556
#> 60 virginica   0.2683431          NA     1.000000
pd.int = generatePartialPredictionData(fit.classif, iris, c("Petal.Width", "Petal.Length"), TRUE)
pd.int
#> PartialPredictionData
#> Task: iris-example
#> Features: Petal.Width, Petal.Length
#> Target: setosa, versicolor, virginica
#> Interaction: TRUE
#> Individual: FALSE
#>    Class Probability Petal.Width Petal.Length
#> 1 setosa   0.1307126    2.500000          6.9
#> 2 setosa   0.1158181    2.233333          6.9
#> 3 setosa   0.1110669    1.966667          6.9
#> 4 setosa   0.1160515    1.700000          6.9
#> 5 setosa   0.1316584    1.433333          6.9
#> 6 setosa   0.1575610    1.166667          6.9

At each step in the estimation of a set of predictions of length is generated. By default the mean prediction is used. For classification where predict.type = "prob" this entails the mean class probabilities. However, other summaries of the predictions may be used. For regression and survival tasks the function used here must either return one number or three, and, if the latter, the numbers must be sorted lowest to highest. For classification tasks the function must return a number for each level of the target feature.

As noted, the fun argument can be a function which returns three numbers (sorted low to high) for a regression task. This allows further exploration of relative feature importance. If a feature is relatively important, the bounds are necessarily tighter because the feature accounts for more of the variance of the predictions, i.e., it is "used" more by the learner.

lrn.regr = makeLearner("regr.blackboost")
fit.regr = train(lrn.regr, bh.task)
bh = getTaskData(bh.task)
pd.regr = generatePartialPredictionData(fit.regr, bh, "lstat", fun = median)
pd.regr
#> PartialPredictionData
#> Task: BostonHousing-example
#> Features: lstat
#> Target: medv
#> Interaction: FALSE
#> Individual: FALSE
#>       medv    lstat
#> 1 17.21100 37.97000
#> 2 17.21100 33.94333
#> 3 17.21100 29.91667
#> 4 17.21100 25.89000
#> 5 17.59456 21.86333
#> 6 19.51325 17.83667
pd.ci = generatePartialPredictionData(fit.regr, bh, "lstat", fun = function(x) quantile(x, c(.25, .5, .75)))
pd.ci
#> PartialPredictionData
#> Task: BostonHousing-example
#> Features: lstat
#> Target: medv
#> Interaction: FALSE
#> Individual: FALSE
#>       medv    lstat    lower    upper
#> 1 17.21100 37.97000 14.03404 18.28213
#> 2 17.21100 33.94333 14.03404 18.28213
#> 3 17.21100 29.91667 14.03404 18.28213
#> 4 17.21100 25.89000 14.03404 18.28213
#> 5 17.59456 21.86333 14.31762 18.46952
#> 6 19.51325 17.83667 16.33624 20.50690
pd.classif = generatePartialPredictionData(fit.classif, iris, "Petal.Length", fun = median)
pd.classif
#> PartialPredictionData
#> Task: iris-example
#> Features: Petal.Length
#> Target: setosa, versicolor, virginica
#> Interaction: FALSE
#> Individual: FALSE
#>    Class Probability Petal.Length
#> 1 setosa  0.10847632     6.900000
#> 2 setosa  0.05687223     6.244444
#> 3 setosa  0.03133824     5.588889
#> 4 setosa  0.02133358     4.933333
#> 5 setosa  0.03139629     4.277778
#> 6 setosa  0.06787746     3.622222

As previously mentioned if the aggregation function is not used, i.e., it is the identity, then the conditional expectation of is estimated. If individual = TRUE then generatePartialPredictionData returns partial predictions made at each point in the prediction grid constructed from the features.

pd.ind.regr = generatePartialPredictionData(fit.regr, bh, "lstat", individual = TRUE)
pd.ind.regr
#> PartialPredictionData
#> Task: BostonHousing-example
#> Features: lstat
#> Target: medv
#> Interaction: FALSE
#> Individual: TRUE
#> Predictions centered: FALSE
#>       medv    lstat idx
#> 1 18.20727 37.97000   1
#> 2 18.20727 33.94333   1
#> 3 18.20727 29.91667   1
#> 4 18.20727 25.89000   1
#> 5 18.35948 21.86333   1
#> 6 20.60585 17.83667   1

The resulting output, particularly the element data in the returned object, has an additional column idx which gives the index of the observation to which the row pertains.

For classification tasks this index references both the class and the observation index.

pd.ind.classif = generatePartialPredictionData(fit.classif, iris, "Petal.Length", individual = TRUE)
pd.ind.classif
#> PartialPredictionData
#> Task: iris-example
#> Features: Petal.Length
#> Target: setosa, versicolor, virginica
#> Interaction: FALSE
#> Individual: TRUE
#> Predictions centered: FALSE
#>    Class Probability Petal.Length      idx
#> 1 setosa   0.2526891          6.9 1.setosa
#> 2 setosa   0.2503856          6.9 2.setosa
#> 3 setosa   0.2524189          6.9 3.setosa
#> 4 setosa   0.2522449          6.9 4.setosa
#> 5 setosa   0.2531258          6.9 5.setosa
#> 6 setosa   0.2529763          6.9 6.setosa

Individual partial predictions can also be centered by predictions made at all observations for a particular point in the prediction grid created by the features. This is controlled by the argument center which is a list of the same length as the length of the features argument and contains the values of the features desired.

pd.ind.classif = generatePartialPredictionData(fit.classif, iris, "Petal.Length", individual = TRUE,
                                               center = list("Petal.Length" = min(iris$Petal.Length)))

Plotting partial predictions

Results from generatePartialPredictionData can be visualized with plotPartialPrediction and plotPartialPredictionGGVIS.

With one feature and a regression task the output is a line plot, with a point for each point in the corresponding feature's grid.

plotPartialPrediction(pd.regr)

plot of chunk unnamed-chunk-10

With a classification task, a line is drawn for each class, which gives the estimated partial probability of that class for a particular point in the feature grid.

plotPartialPrediction(pd.classif)

plot of chunk unnamed-chunk-11

For regression tasks, when the fun argument of generatePartialPredictionData is used, the bounds will automatically be displayed using a gray ribbon.

plotPartialPrediction(pd.ci)

plot of chunk unnamed-chunk-12

When multiple features are passed to generatePartialPredictionData but interaction = FALSE, facetting is used to display each estimated bivariate relationship.

plotPartialPrediction(pd.lst)

plot of chunk unnamed-chunk-13

When interaction = TRUE in the call to generatePartialPredictionData, one variable must be chosen to be used for facetting, and a subplot for each value in the chosen feature's grid is created, wherein the other feature's partial predictions within the facetting feature's value are shown. Note that this type of plot is limited to two features.

plotPartialPrediction(pd.int, facet = "Petal.Length")

plot of chunk unnamed-chunk-14

plotPartialPredictionGGVIS can be used similarly, however, since ggvis currently lacks subplotting/facetting capabilities, the argument interact maps one feature to an interactive sidebar where the user can select a value of one feature.

plotPartialPredictionGGVIS(pd.int, interact = "Petal.Length")

When individual = TRUE each individual conditional expectation curve is plotted.

plotPartialPrediction(pd.ind.regr)

plot of chunk unnamed-chunk-16

When the individual curves are centered by subtracting the individual conditional expectations estimated at a particular value of this results in a fixed intercept which aids in visualizing variation in predictions made by .

plotPartialPrediction(pd.ind.classif)

plot of chunk unnamed-chunk-17